JCO Clinical Cancer Informatics
● American Society of Clinical Oncology (ASCO)
Preprints posted in the last 90 days, ranked by how well they match JCO Clinical Cancer Informatics's content profile, based on 18 papers previously published here. The average preprint has a 0.04% match score for this journal, so anything above that is already an above-average fit.
Petalcorin, M. I. R.
Show abstract
Background: Early-phase oncology development increasingly depends on integrated interpretation of clinical outcomes, translational biomarkers, and pharmacokinetic exposure rather than toxicity alone. This shift has created a need for reproducible analytical workflows that can combine heterogeneous trial data into traceable, analysis-ready outputs suitable for exploratory review and early decision support. Objective: To develop a reproducible Python-based workflow that simulates a plausible early-phase oncology study, integrates clinical, biomarker, and pharmacokinetic data, and generates analysis-ready datasets, visual summaries, and exploratory predictive models relevant to early development analytics. Methods: A workflow was constructed to simulate an early-phase oncology cohort of 120 patients distributed across multiple dose levels. Three synthetic raw data sources were generated, including patient-level clinical data, baseline biomarker data, and longitudinal pharmacokinetic profiles. These sources were merged into a single analysis-ready dataset containing derived variables such as tumor percent change from baseline, clinical-benefit status, exposure summaries, adverse-event indicators, and survival outcomes. The workflow produced structured tables, patient listings, waterfall plots, Kaplan-Meier-style survival curves, biomarker-response visualizations, pharmacokinetic profile plots, and exploratory machine-learning outputs. Results: The final integrated dataset contained 120 patients and 30 variables. Median survival across the simulated cohort was 243.8 days, and higher dose groups showed improved median survival and greater clinical benefit relative to the low-dose group. Clinical benefit increased from 8.6% in the low-dose group to 29.0% in the medium-dose group and 45.2% in the high-dose group. Higher baseline LDH, CRP, and ctDNA fraction tracked with less favorable tumor-response trajectories, whereas higher exposure, reflected by AUC and Cmax, associated with improved disease control. Pharmacokinetic profiles showed clear dose-dependent separation. Grade 3 or higher adverse-event rates remained within a plausible exploratory range across dose groups. A random-forest model for clinical benefit achieved an exploratory ROC AUC of 0.845, while a logistic-regression model for strict responder status could not be fit because no simulated patient met the prespecified objective response threshold. Conclusions: This proof-of-concept demonstrates that a transparent Python workflow can generate a coherent early-phase oncology analytical ecosystem from synthetic inputs. The workflow supports integration of heterogeneous data streams, derivation of analysis-ready variables, production of interpretable outputs, and exploratory modeling in a reproducible framework. Although the simulated responder prevalence was too low to support objective response modeling, this limitation itself highlights the importance of simulation calibration for downstream analytical validity. The framework provides a practical Health Informatics demonstration of how early oncology trial data can be structured and analyzed for exploratory translational decision support.
Dickerson, J. C.; McClure, M. B.; Shaw, M.; Reitsma, M. B.; Dalal, N. H.; Kurian, A. W.; Caswell-Jin, J. L.
Show abstract
Background: Manual chart abstraction is a major bottleneck in clinical research. In oncology, important outcomes such as disease recurrence and the treatment history are often only documented in clinical notes, limiting the scale and quality of observational and epidemiologic studies. We developed an open-source pipeline that, in a HIPAA-compliant setting, can use any commercially available large language model (LLM) to determine whether variables from complex longitudinal oncology records can be abstracted with performance similar to that of expert medical oncologists. Methods: We randomly selected 100 patients from an institutional breast cancer cohort enriched for complex care. We abstracted a range of key variables from unstructured data, including dates of diagnosis and recurrence, clinical stage, biomarker subtype, genetic testing results, and prescribed systemic therapies, including treatment timing, intent, and reason for discontinuation. The inputs to the LLM were unnormalized, unlabeled, and unedited clinical notes, pathology reports, med admin records, and demographics. Breast oncologists abstracted the same variables to create the reference standard. For systemic therapy extraction, a second oncologist and research coordinators served as comparators. In addition to variable-level performance, we examined whether survival and hazard-ratio estimates were similar for fully LLM-derived datasets compared with expert-derived datasets. Results: Among 100 patients, the median chart had more than 3,100 pages of text; patients received a median of 7 lines of therapy over 6.5 years of follow-up. The best-performing LLM achieved 99% concordance with the expert for recurrence status, 100% for germline BRCA1/2 pathogenic variant detection, 99% for hormone receptor status, 96% for HER2 status, 91% for clinical stage, 91% for PIK3CA mutation status, and 90% for ESR1 mutation status. For anti-cancer drug extraction, the best-performing LLM approached inter-oncologist variability. For exact therapy-line reconstruction, mean patient-level performance remained 9 percentage points lower than the second oncologist, although inter-LLM disagreement was similar to inter-oncologist disagreement. All four LLMs tested outperformed the research coordinators on systemic therapy abstraction. Recurrence-free survival, overall survival, and hazard ratio estimates were similar between expert-derived and LLM-derived datasets. In an external cohort of 97 young patients with early-stage breast cancer, the unmodified pipeline showed similar performance for recurrence detection and adjuvant endocrine therapy use. Conclusions: Off-the-shelf LLMs in a fixed retrieval pipeline were able to abstract a range of variables from complex longitudinal oncology records with performance approaching inter-oncologist variability for key tasks, without any fine-tuning or institution-specific retraining. This approach offers a practical path to scaling the creation of research-grade retrospective datasets from narrative medical records.
Soltanifar, M.; Portuguese, A. J.; Jeon, Y.; Gauthier, J.; Lee, C. H.
Show abstract
Oncology research and clinical practice in North America increasingly rely on complex endpoints, heterogeneous study designs, and high-dimensional molecular data. In this landscape, data visualization serves as a critical analytic instrument for study design communication, model diagnostics, safety reporting, and real-time clinical decision support. Despite its importance, the oncology visualization ecosystem remains fragmented across commercial platforms and bespoke scripts, lacking a unified, code-first reference that emphasizes reproducibility and auditability in the R programming environment. This paper addresses this gap by presenting a North American collaborative atlas of 62 oncology visualization templates: 24 for clinical trials, 12 for real-world evidence (RWE), and 26 common to both settings. A core innovation of this atlas is its simulation-driven approach; each plot is illustrated using transparent, reproducible data-generating mechanisms. This allows users to deterministically recreate figures and easily adapt templates to alternative endpoints, censoring patterns, and subgroup structures. The paper provides foundational notation for oncology endpoints, an operational taxonomy based on data geometry, and a consolidated review of relevant R software. We further synthesize the practical utility of these methods through four representative case studies and provide a comparative analysis of the strengths, limitations, and future challenges of oncology data visualization. A detailed tutorial on fishplot is included to demonstrate a publication-ready workflow for clonal evolution.
Petalcorin, M. I. R.
Show abstract
Background: Modern oncology development depends on integrating radiographic response, molecular biomarkers, treatment exposure, safety, and survival endpoints, yet access to well-structured patient-level trial data is often limited. Methods: We developed a synthetic, literature-informed phase II randomized oncology trial framework that followed the sequence Patient [->] Data [->] Dataset [->] Analysis [->] Tables/Figures [->] Decision. A cohort of randomized patients was simulated with baseline demographic and disease features, longitudinal tumor measurements, circulating tumor DNA, inflammatory and exploratory biomarkers, adverse events, treatment exposure, and survival outcomes. Raw source datasets were transformed into SDTM-like domains and ADaM-like analysis datasets, then analyzed for baseline characteristics, exposure, best overall response, survival, subgroup hazard ratios, longitudinal tumor and biomarker changes, exposure-response, and safety. Results: The treatment arm showed a coherent efficacy signal across multiple analytical layers. Treatment increased objective response and clinical benefit, reduced tumor burden over time, and prolonged survival. Median overall survival increased from 135 days in the control arm to 288 days in the treatment arm, with an approximate hazard ratio of 0.661 (95% CI, 0.480-0.911; p = 0.011). Median progression-free survival increased from 116 to 208 days, with an approximate hazard ratio of 0.601 (95% CI, 0.418-0.864; p = 0.006). Circulating tumor DNA showed a more favorable trajectory in treated patients and aligned directionally with radiographic and survival benefit. Safety analyses showed increased treatment-related toxicity, but the overall safety profile remained interpretable and compatible with continued development. Conclusions: This study demonstrates that a synthetic, literature-informed oncology trial can reproduce a biologically plausible and analytically coherent efficacy-safety signal architecture across radiographic, molecular, and time-to-event endpoints, providing a decision-oriented prototype for translational oncology clinical data science. Keywords: synthetic clinical trial, oncology, ctDNA, Kaplan-Meier, biomarker, survival analysis, translational data science, ADaM, SDTM
Dua, A.; Obermeyer, Z.; Butte, A. J.; Zack, T.
Show abstract
BackgroundFOLFIRINOX is a cornerstone regimen for eligible patients with pancreatic ductal adenocarcinoma (PDAC), but its clinical benefit is limited by substantial toxicity and frequent dose modification. In real-world practice, dose modifications are often individualized, and the clinical factors associated with these decisions remain incompletely characterized. ObjectiveTo develop and evaluate an electronic medical record (EMR)-based machine-learning framework for modeling cycle-specific FOLFIRINOX dose modification decisions in patients with PDAC. MethodsWe included patients with PDAC who received FOLFIRINOX at UCSF oncology clinics between November 2011 and December 2023. Predictors included demographic, clinical, laboratory, and treatment variables derived from the EMR. Logistic regression, random forest, and XGBoost models were trained using group-based 5-fold cross-validation to predict cycle-specific dose modifications for 5-fluorouracil, irinotecan, and oxaliplatin. Model performance was evaluated using area under the receiver operating characteristic curve. ResultsThe cohort included 514 patients receiving FOLFIRINOX across 5,041 treatment cycles. The mean age was 59 years, 60% of patients were White, 41% had a history of smoking, and patients received a median of 6 chemotherapy cycles. More than 60% of patients required at least one dose modification during treatment. XGBoost demonstrated the highest performance across component drugs, with AUCs ranging from 0.53 to 0.70. Clinically plausible predictors of irinotecan and oxaliplatin dose modification included hepatic and renal function markers, cumulative drug exposure, treatment-related symptoms, and demographic or behavioral characteristics. ConclusionWe developed an EMR-based machine-learning framework to model real-world FOLFIRINOX dose modification and identified clinically plausible, routinely available predictors, particularly for irinotecan and oxaliplatin. Variable model performance suggests that dosing decisions are only partially captured by structured EMR data, highlighting both the limitations of current data-driven approaches and clinical domains where ML-based models may support individualized dosing and toxicity surveillance. Future informatics efforts should incorporate dose-modification rationale, patient-reported and functional outcomes, and validation across diverse practice settings.
Goel, K. P.; Myall, N. J.; Dickerson, J.; Caswell-Jin, J. L.; Johnson, T.; Worth, J. E.; Gensheimer, M. F.
Show abstract
PURPOSE: To develop and validate an artificial intelligence-enabled platform that converts unstructured cancer trial eligibility criteria into structured queries and quantifies trial eligibility across advanced/metastatic cancer trials. METHODS: We downloaded actively recruiting US interventional treatment trials for advanced/metastatic breast cancer, colon cancer, and non-small cell lung cancer from ClinicalTrials.gov. Medical oncologists created 24 synthetic patient vignettes. A large language model converted trial eligibility criteria into Structured Query Language (SQL) code and patient information into structured records, enabling automated matching. Cancer details and treatment history were considered, but not laboratory results or comorbidities. Validation included physician editing of generated eligibility code for 30 trials, and blinded physician eligibility assessment for five trials. We then evaluated how age, ECOG performance status, sex, and ZIP code affected the number of eligible trials. RESULTS: Of 833 candidate trials, 746 met inclusion criteria. In physician review of 30 trials, edits to generated SQL did not change any of 720 trial-patient eligibility determinations for 24 synthetic patients. In blinded validation across 120 trial-patient pairs, automated matching achieved 97% accuracy. Across synthetic patients, eligible trials ranged from 31 to 258 when there were no geographic restrictions. Eligibility decreased markedly with worse performance status and with geographic restriction (both p<0.001). Later-phase, randomized, and molecularly selective trials had fewer eligible patients. CONCLUSION: AI-based structuring of trial eligibility criteria can support accurate, scalable measurement of potential cancer trial eligibility. In this demonstration, performance status, geography, and age were major determinants of eligibility across the active metastatic trial landscape.
Hughes, N.; Hogenboom, J.; Carter, R.; Norman, L.; Gouthamchand, V.; Lindner, O.; Connearn, E.; Lobo Gomes, A.; Sikora-Koperska, A.; Rosinska, M.; Pogoda, K.; Wiechno, P.; Jagodzinska-Mucha, P.; Lugowska, I.; Hanebaum, S.; Dekker, A.; van der Graaf, W.; Husson, O.; Wee, L.; Feltbower, R.; Stark, D.
Show abstract
Background: Population-based cancer registers (PBCR) are important for monitoring trends in cancer epidemiology, facilitating the implementation of effective cancer services. Adolescents and Young Adult (AYA) with cancer are a patient group with a unique set of needs. The utility of PBCR in AYA is limited by the lack of AYA-specific data items. STRONG AYA, an international multidisciplinary consortium is addressing this through federated learning (FL) methodology and novel data visualisation concepts. A Core Outcome Set (COS) has been developed to measure outcomes of importance through clinical data and Patient Reported Outcomes (PROs). We describe how data from the Yorkshire Specialist Register of Cancer in Children and Young People (YSRCCYP), a PBCR in the UK is being used within STRONG AYA and how the subsequent analyses can guide patient consultations. Methods: Data from the YSRCCYP were imported into a Vantage 6 node, from which FL analyses are performed along with data provided by other consortium members. The results are extracted into the PROMPT software and integrated into patient electronic healthcare records. Results: Healthcare professionals can view the results of individual PROs at various time points and in comparison, to summary analyses carried out within the STRONG AYA infrastructure. Results can be filtered by age, disease, country and stage. Conclusion: We have demonstrated how a regional PBCR can contribute to a pan-European infrastructure and analyses viewed to enhance patient consultations. Such analyses have the potential to be used for research and policy-making, improving outcomes for AYA.
Kordes, M.; Chakravarty, D.; Boberg, E.; Creignou, M.; de Petris, L.; Karlsson, C.; Burstrom, L. L.; Suehnholz, S.; Yachnin, J.; Wiklander, O. P.; Haglund de Flon, F.
Show abstract
Background. The European Society for Medical Oncology (ESMO) Scale for Clinical Actionability of molecular Targets (ESCAT) ranks genomic alterations by the evidence supporting the predictive value of the molecular target for response to targeted therapies. No openly available, systematically curated set of standard care biomarkers mapped to the ESCAT framework exists to support clinical decision-making or harmonize biomarker interpretation. Methods. We mapped all OncoKBTM Level 1 biomarkers to ESCAT tiers using evidence cited by OncoKBTM, excluding abstract-only data. Eight board-certified oncologists and hematologists independently assigned ESCAT tiers, with discrepancies resolved through structured consensus meetings. Recurring evidence scenarios that did not correspond to any existing ESCAT tier informed a set of a priori defined modifications, which were subsequently applied to biomarkers that could not be classified using native ESCAT criteria. Results. Of 188 OncoKBTM Level 1 biomarkers, 16 were excluded due to abstract-only evidence. Using native ESCAT criteria, 51% of the remaining biomarkers were classified as Tier 1, 3% Tier 2, 18% Tier 3, 6% Tier X and 22% could not be assigned to any tier. Applying the modified ESCAT criteria resolved all previously unclassifiable biomarkers and increased Tier 1 assignments to 73%. Inter-rater reliability (Krippendorffs alpha) was moderate (0.586) and 62% of classifications required consensus discussions. Comparison with ESCAT tiers reported in ESMO Clinical Practice Guidelines showed improved concordance when using the modified criteria. Conclusions. The native ESCAT criteria are highly stringent, resulting in many FDA-recognized, clinically validated biomarkers that are currently assigned level 1 by OncoKBTM not mapping to any existing tier. Our predefined modifications improved alignment with OncoKBTM Level 1 designations and with published ESMO clinical practice guidelines. The mapped set of standard care biomarkers are provided on the OncoKBTM website, offering a practical resource that harmonizes ESCAT tiers of evidence with a widely adopted levels of evidence schema.
Balaji, S.; Campbell, K.; Chen, R.-Z.; Smith, D. G.; Reyna, M. A.; Sarker, A.; Wallach, J. D.; Parikh, R. B.; Bozkurt, S.
Show abstract
BackgroundIdentification of metastasis status in non-small cell lung cancer (NSCLC) is a critical part of understanding disease prognosis, treatment courses, trial eligibility, and population-level cancer surveillance. However, metastasis record are inconsistently recorded in structured cancer registry fields, since manual abstraction of clinical notes is often a resource intensive and error-prone process. This challenge highlights an opportunity for leveraging large language models (LLMs) to conduct high-scale metastasis extraction from real-world clinical documentation. ObjectiveWe conducted a retrospective, multi-cohort comparative evaluation of three distinct LLMs for two independent classification tasks: overall metastasis presence at any site and brain/CNS metastasis presence. We evaluated model performance on two independent NSCLC cohorts: (1) a registry-linked cohort used for model development and validation and (2) an independent cohort with manual note-level annotations for additional validation. We further explored whether our methods could analyze clinical documentation and recover missing or outdated metastasis information in structured registry labels. MethodsPatient cohorts were derived from the Winship Cancer Institute. Cohort 1 (n=579 patients; 24,887 notes across 69 note types; 2023-2025) used registry-linked metastasis fields as the reference standard. Cohort 2 (n=22 patients; 644 radiology notes; 2010-2021) was drawn from two completed randomized trials and used dual-annotator manual labels (Cohens &[kappa]: 0.93 overall metastasis, 0.88 CNS metastasis) as the reference standard. We fine-tuned the GatorTron-base encoder model for each independent binary classification task, respectively. We evaluated MedGemma-27B-text and Llama 3.1-70B using zero-shot prompting. A separate cohort of 675 patients with missing or unknown registry labels was used for an exploratory missingness-recovery analysis, validated against manual annotations of a random subsample. ResultsMore than half (54%) of initially identified Cohort 1 patients had missing or unknown registry metastasis labels. For overall metastasis, fine-tuned MedGemma demonstrated the best performance in overall metastasis classification (Cohort 1: F1=0.80, Cohort 2 patient level: F1=1.0, Cohort 2 note level: F1=0.93). For brain/CNS metastasis, Llama3 performed best in both cohorts (Cohort 1: F1=0.79, Cohort 2 patient-level: F1=0.93, Cohort 2 note-level: F1=0.86). The fine-tuned GatorTron model showed strong performance for classification of overall metastasis in Cohort 1 (F1=0.72). Error analysis indicated that most model errors reflected incomplete registry labels, ambiguous clinical language, or missing documentation rather than true model errors. In the exploratory recovery analysis, model predictions agreed with manual annotations at accuracy=0.90 and F1=0.89. ConclusionsAll models demonstrated relatively high performance. The zero-shot generative models were more robust to nuanced documentation and context-dependent brain/CNS metastasis extraction. The fine-tuned encoder model demonstrated strong classification performance but may have been limited by potential inaccuracies in the registry reference standards during model training. This study further demonstrated the potential of LLMs in recovering clinically plausible structured labels from narrative text, complementing cancer registries for metastasis ascertainment.
Shim, K. B.
Show abstract
Pancreatic ductal adenocarcinoma (PDAC) remains one of the deadliest solid tumors and continues to face low treatment-trial participation, fragmented evidence workflows, and labor-intensive abstraction of unstructured clinical text. Existing oncology-focused language models show promise, but many depend on private institutional corpora, limiting reproducibility and practical reuse across centers. We present Onca, an open 9B dense model designed for four PDAC-relevant tasks: trial eligibility screening, case-specific clinical reasoning, structured pathology report extraction, and molecular variant evidence reasoning. Onca is fine-tuned from Qwopus3.5-9B-v3 with a single Unsloth BF16 LoRA adapter on 37,364 training rows drawn from openly available sources. The evaluation spans 11 panels and compares Onca against Woollie-7B, CancerLLM-7B, OpenBioLLM-8B, and the unmodified Qwopus base. Onca achieves the strongest overall results on Trial Screening (81.6 F1), Clinical Reasoning (14.1 composite), Pathology Extraction (30.5 field exact-match), Pub-MedQA Cancer (68.3 macro-F1), and PubMedQA (66.5 macro-F1). The strongest gains appear in tasks closest to routine oncology workflow, especially trial review and pathology structuring. These findings suggest that clinically targeted pancreatic-cancer language models can be built from open data with competitive performance while remaining practical to train on a single workstation-scale GPU setup.
Guillot, J.; Miao, B.; Suresh, A.; Sushil, M.; Williams, C. Y.; Vashisht, R.; Oskotsky, T. T.; Sirota, M.; Butte, A. J.
Show abstract
Chimeric Antigen Receptor T-cell (CAR-T) therapy, where genetically engineered patient T cells target tumor antigens, has transformed care for hematologic malignancies but requires careful tracking of adverse events (AEs) often documented only in unstructured EHR notes. We evaluated a Large Language Model (LLM)-based approach in UCSFs secure environment to extract AEs, dates, grades, and interventions within 30 days post-infusion for six commercial CAR-T products (2012-2023), benchmarking against two evaluators. Using GPT-4-0314 in a zero-shot setting with four prompts (prespecified AEs, non-prespecified AEs, CRS, ICANS), we compared outputs against dual annotations on a random sample of 50 notes using accuracy, precision, recall, F1, and Cohens kappa. From 4,762 progress notes for 293 patients (median age 65.6), CRS occurred in 80.2% (median onset 4 days); neutropenia 70.0% (16 days); neutropenic fever 64.8% (4 days); ICANS in 34.8%. Interventions included tocilizumab and corticosteroids. Grades were frequently undocumented (CRS 62.3%, ICANS 56.1%); documented cases were mainly CRS grade 1 (59.4%) and ICANS grade 2 (28.0%). Performance was high on CRS and ICANS grading (accuracy of 0.97 and 0.91, respectively). Moderate performances were assessed for prespecified AE extraction (accuracies 0.62-0.76), and non-prespecified AEs (accuracies 0.76-0.84). Inter-rater reliability was strong to near-perfect for CRS/ICANS presence and grade (kappa 0.86-0.96), moderate for dates and interventions, and weaker for broader AE attributes. LLM-derived insights can augment AE monitoring and real-world evidence generation by unlocking unstructured clinical detail and characteristic timelines after CAR T. However, performance varied for broader AE attributes, warranting cautious use. Performance was highest for detecting the presence and grade of CRS and ICANS, with strong to near-perfect inter-rater reliability. While cautious use of LLMs for broad AE extraction is warranted due to the variable performance observed in this study, these results support integrating high-performing CRS/ICANS extraction into EHR workflows. Author summaryChimeric Antigen Receptor T-cell (CAR-T) therapy has transformed care for blood cancer but requires careful tracking of adverse events (AEs). We asked whether a large language model could read routine clinical notes and extract AEs after CAR T-cell therapy. We analyzed de-identified notes from the first month after infusion. The model identified when two key side effects occurred--cytokine release syndrome (a whole-body inflammatory reaction) and neurotoxicity (brain and nerve symptoms)--and how severe they were, with accuracy similar to human reviewers. It also captured when side effects started and what treatments were given, though performance was more variable for the wider range of side effects beyond these two. In our data, these reactions often arose within the first week; blood count problems and infections were also common. Because many notes did not state severity explicitly, the model sometimes could not assign a grade. Our findings suggest that language models can help unlock important details hidden in clinical notes and could be incorporated into electronic records to support faster, more reliable side-effect monitoring and research. We recommend careful, supervised use and continued validation, especially for broader side-effect categories.
Cannon, M. J.; Bratulin, A.; Kuzma, K.; Puthawala, D.; Corsmeier, D.; Schieffer, K.; Kelly, B.; Cottrell, C.; Wagner, A. H.
Show abstract
Genomic medicine relies on expert evaluation of genomic variants, but this process is dramatically slowed by a lack of readily-accessible genomic knowledge. Although genomic knowledge resources such as ClinVar and CIViC support structured data sharing and provide interfaces for adding structure, much of the variant interpretation data generated upstream of these resources is not readily interoperable with these resources, limiting the ability of clinical labs to share data and creating knowledge silos. Here we evaluate a strategy for breaking down these knowledge silos in a pilot study to transform semi-structured variant classification knowledge into computable clinical assertions leveraging the Global Alliance for Genomics and Health (GA4GH) Genomic Knowledge Standards specifications. We programmatically mapped previously captured somatic cancer clinical significance classifications from spreadsheets to the GA4GH Variant Annotation specification. For diagnostic classification data, this approach enabled reuse of standards-aware submission tooling to share 1,499 records to ClinVar. We then studied how AI-assisted curation approaches to overcome gaps in unstructured text enabled scalable curation of prior classifications in unstructured text. Using this approach, we were able to accurately classify clinical significance for 71.8% (117/163) of randomly sampled prognostic evidence statements. We conclude with an overview of how this work may be generalized to make computationally inaccessible variant evidence from other clinical laboratories broadly reusable in downstream knowledgebases such as CIViC and ClinVar.
Enikeev, R.; Moldovan, M.; Chu, M.; Amalraj, A.; Koli, P. P.; Abdul, S. S.; Sivaraj, H.; Iqbal, U.; Toh, C. K.
Show abstract
Background: Structuring oncology clinical notes into registry-grade variables is essential for research and care but remains labour-intensive and error-prone. Objective: To develop and evaluate a privacy-preserving large language model pipeline for oncology registry abstraction in a real-world clinical setting. Methods: We deployed an open-source Meta Llama 3.3 70B-based pipeline to extract over 50 variables from 6,700 oncology notes at a cancer centre in Singapore. Data were de-identified locally using a Hide-In-Plain-Sight approach, ensuring no identifiable data left hospital infrastructure. Performance was assessed on 200 randomly sampled notes with adjudicated ground truth. A structure-aware framework classified outputs as correct, missing, spurious, or incorrect. Results: F1 scores were high across variables, including diagnosis (97.2%), histology (95.8%), stage (92.6%), biomarkers (91.4%), and treatments (88.1%). Transferability testing on 50 external notes showed strong performance for core variables. Conclusions: Privacy-preserving LLMs can achieve near-human-level accuracy for oncology abstraction, with structure-aware evaluation enabling more clinically meaningful assessment. Keywords: Oncology Registry Abstraction, Privacy-Preserving Deployment, Clinical Information Extraction, Structure-Aware Evaluation, Large Language Models, Template-Filling Metrics
McPhaul, T.; Kreimeyer, K.; Baris, A.; Botsis, T.
Show abstract
Cancer data standardization requires converting unstructured pathology reports into structured registry variables, a mostly manual and resource-intensive task. We evaluated two automated extraction platforms: Brim Analytics, an LLM-based system that guides and orchestrates abstraction, and DeepPhe, an ontology-driven system. Using 330 pancreatic adenocarcinoma and 34 breast cancer pathology reports from Johns Hopkins Hospital, we assessed both under deployment-realistic conditions. Brim Analytics achieved high accuracy across seven registry variables in pancreatic cancer (mean 96.7%), including T stage (96.4%) and histologic grade (97.0%), with a 3.0 p.p. decline on breast cancer (mean 93.7%). DeepPhe performed comparably for N stage (96.4% pancreatic, 94.1% breast) but had notable T stage deficits (83.6% pancreatic, 70.6% breast). Per-report processing times averaged 0.9 s (Brim, pancreatic), 4.6 s (Brim, breast), 1.1 s (DeepPhe, pancreatic), and 3.5 s (DeepPhe, breast). These results indicate that LLM-based extraction can achieve high accuracy across cancer types and support automated data workflows.
Walinjkar, A.
Show abstract
Background: Circulating tumour DNA (ctDNA) liquid biopsy is now established across oncology for early cancer detection, minimal residual disease surveillance, and treatment monitoring. Detection thresholds for all current ctDNA assays are derived empirically through receiver operating characteristic analysis on training cohorts - a statistically valid but theoretically uninformed approach that does not specify the minimum detectable tumour fraction given assay technical characteristics, nor identify when increasing sequencing depth ceases to provide additional clinical information. Methods: We model ctDNA detection as a binary hypothesis testing problem with Binomial-distributed mutant allele counts against a sequencing error noise floor. The Neyman-Pearson lemma is applied to derive the uniformly most powerful detector and the minimum detectable tumour fraction in closed form. The sequencing assay is modelled as a binary symmetric channel and Shannon channel capacity is calculated. Empirical validation uses n=61 data points extracted from five published peer-reviewed analytical validation studies across five independent institutions in the US and EU (2018 - 2025): Yu et al. 2022, Stetson et al. 2018, Frydendahl et al. 2023, Northcott et al. 2024, and Cheng et al. 2025. Results: The minimum detectable tumour fraction is derived in closed form as f_min approximately equal to (z_alpha + z_beta) multiplied by the square root of (epsilon divided by N), where N is sequencing depth, epsilon is the platform error rate, and z_alpha, z_beta are standard normal quantiles at the specified false positive and false negative rates. Shannon channel capacity is C = 1 minus H(epsilon) bits per read, where H(epsilon) is binary entropy. Empirical validation yields 84.3% agreement for single-locus assays. Discordance for multi-locus tumour-informed assays (NeXT Personal, duplex WGS) is consistent with the single-locus model scope and identifies the principal theoretical extension required. Conclusions: This framework provides the first formal Neyman-Pearson optimality proof for ctDNA detection, a closed-form detection limit, and a platform-independent efficiency metric for NHS and regulatory standardisation. Keywords: circulating tumour DNA; liquid biopsy; Neyman-Pearson detection; Shannon channel capacity; sequencing depth; limit of detection; minimal residual disease; signal detection theory
Roy, J.; Korleski, J. B.; Augustin, R. C.; Yefet, L.; Jensen, Z. D.; Ehman, E. C.; Zadeh, G.; Conners, A. L.; Tevaarwerk, A. J.; Korfiatis, P.
Show abstract
Background: Preparing tumor board patient summaries is time intensive. Large-language-model based systems may automate summarization but require real-world evaluation prior to clinical use. We performed an exploratory retrospective evaluation of the Microsoft Healthcare Agent Orchestrator (HAO), deployed in a Mayo Clinic controlled staged environment, to generate tumor board-style patient summaries from retrospective Electronic Health Record (EHR) notes. Methods: HAO generated summaries for breast, hepatobiliary, and neuro-oncology tumor board cases using up to the most recent 1,000 clinical notes. Clinician reviewers evaluated outputs via REDCap surveys across perceived factuality, completeness, clarity/conciseness, temporal cohesion, comparative performance, safety, and clinical utility (0-4 Likert scale). Reviewers were permitted to query the HAO chat interface to address missing details. Automated factuality was assessed using TBFact (bidirectional entailment), reporting precision and recall against available reference summaries. Results: Among 57 survey responses from 5 different physicians, mean scores exceeded 2.8 across domains, with medians of 3 for most axes. In an exploratory comparison, oncology fellows required less time to review HAO-generated summaries than to manually generate patient summaries (mean difference 13.57 minutes per patient, p<0.001), although this difference may be influenced by prior familiarity with the same cases; 96% of survey responses indicated that HAO would save time. TBFact evaluations showed higher recall than precision across domains, consistent with broad capture of reference content alongside additional content that was not present in gold-standard summaries. Attribution was viewed favorably but showed issues with primary-source specificity and link reliability. Conclusions: In a controlled Mayo environment, HAO demonstrated moderate performance and was associated with reduced review time for tumor board preparation. These findings are promising but preliminary and do not establish clinical safety, noninferiority to manual review, or readiness for routine clinical use. Limitations, including verbosity, specialty-specific content gaps, and inconsistent attribution, highlight the need for iterative refinement and further evaluation.
Sharma, T.; Chopra, A. P.; Agrawal, L.; Verma, N. K.; Starlard-Davenport, A.; Wang, J.; Hayes, D. N.; Cui, Y.
Show abstract
PurposeMachine learning (ML) models for omics-based cancer prognosis are often trained on data from predominantly European-ancestry populations, producing biased predictions for other populations and undermining equitable genomic medicine. Existing fairness benchmarks mainly focus on outcome parity rather than predictive performance parity across populations. Public benchmark resources are needed for systematically detecting and mitigating such performance disparities in multi-population cancer prognosis. MethodsWe developed Equitable Health Intelligence (EHI, https://ehiportal.org), an open-source benchmark of multi-population ML for omics-based cancer prognosis. EHI contains 1,475 ML tasks across 40 cancer/pan-cancer types, 4 omics feature sets, 4 clinical endpoints, 5 event-time thresholds, and 3 data-disadvantaged population (DDP) groups relative to a majority European Ancestry population group. Deep neural network models are trained under three multi-population ML schemes (Mixture, Independent, and Transfer Learning), with Naive Transfer included as a no-adaptation control, comprising a total of 10,325 ML experiments. ResultsThe EHI platform provides an interactive environment with visualization and exploratory tools for users to inspect predictive performance disparities between the majority European-ancestry group and data-disadvantaged populations, evaluate the extent to which transfer learning mitigates these disparities, and examine the impact of feature engineering methods across cancer types, omics features, and clinical endpoints. ConclusionEHI is an open, interactive, and extensible benchmark for identifying and addressing performance disparities in multi-population ML for omics-based cancer prognosis. It provides a foundation for a growing ecosystem of methods targeting ML performance disparities arising from biomedical data inequality and population-level distribution shifts, thereby advancing equitable AI in precision oncology.
Bouteiller, J.; Gryspeert, A.-R.; Caron, J.; Polit, L.; Altay, G.; Cabantous, M.; Pietrzak, R.; Graziosi, F.; Longarini, M.; Schutte, K.; Cartry, J.; Mathieu, J. R.; Bedja, S.; Boileve, A.; Ducreux, M.; Pages, D.-L.; Jaulin, F.; Ronteix, G.
Show abstract
BackgroundPredicting whether a treatment will demonstrate meaningful clinical benefit before committing to a large-scale trial remains a major unmet need in oncology. Patient-derived organoids (PDOs) recapitulate individual tumor drug sensitivity, but have not been used to fore-cast population-level trial outcomes. We developed SCOPE (Screening-to-Clinical Outcome Prediction Engine), a platform that integrates PDO drug screening with clinical prognostic modeling to predict arm-level median progression-free survival (mPFS) and objective response rate (ORR) without access to any trial outcome data. Patients and methodsSCOPE was trained on 54 treatment lines from 52 independent patients with metastatic colorectal cancer (mCRC, n=15) and metastatic pancreatic ductal adeno-carcinoma (mPDAC, n=39) with matched clinical data and PDO drug screening across 9 compounds. A Clinical Score module captures baseline prognosis; a Drug Screen Score module quantifies treatment-specific organoid sensitivity. To predict trial outcomes, synthetic patient profiles are generated from published eligibility criteria and matched to a biobank of 81 PDO lines. Predictions were externally validated against 32 arms from 23 published trials, treatment ranking was assessed across 8 head-to-head comparisons, and prospective applicability was tested for Daraxonrasib (RMC-6236), a novel pan-RAS inhibitor in mPDAC. ResultsPredicted mPFS strongly agreed with published outcomes (R2=0.85, MAE=0.82 months; Pearson r=0.92, P <0.001), approaching the empirical concordance between two independently measured clinical endpoints (ORR vs. mPFS, R2=0.87). ORR prediction was similarly robust (R2=0.71, MAE=7.3 percentage points). Integrating organoid and clinical data significantly out-performed either alone (P =0.001). SCOPE correctly identified the superior arm in 7 of 8 head-to-head comparisons (88%, P <0.05). Applied to Daraxonrasib prior to phase 3 data availability, the platform predicted superiority over standard chemotherapy in KRAS-mutant mPDAC, consistent with emerging trial data. ConclusionBy combining functional organoid drug screening with clinical modeling, SCOPE generates calibrated efficacy predictions for both established regimens and novel agents without prior trial data. This approach could support clinical trial design, treatment arm selection, and go/no-go decisions, offering a new tool to improve the efficiency of gastrointestinal cancer drug development.
Jiang, Y.; He, X.; Ai, X.; Jalal, S.; Maniar, R.; Majji, R. K.; Zhang, Y.; Liu, J.; Fedele, D.; Zhuang, Y.; Hollenbach, J.; Bian, J.
Show abstract
Clinical chart abstraction extracts structured patient variables from longitudinal clinical notes but is labor-intensive and difficult to scale. We evaluated LLM agents for question-guided chart review using lung cancer molecular testing guideline concordance as a use case. Two configurations were compared: (1) sequential note review using metadata and chronology, and (2) the same framework augmented with keyword-based note search. Gold-standard labels were established by human annotators. The search-enabled agent achieved higher accuracy (92.4% vs. 83.5%) and reduced errors by more than half (41 vs. 89) by retrieving evidence from long, heterogeneous note histories. In guideline concordance evaluation, most determinate patient-rule assessments were concordant (80.7%), while most apparent non-concordance reflected missing molecular testing documentation rather than documented care deviations. These results suggest tool-augmented LLM agents can approximate key aspects of human chart review and support scalable information extraction from longitudinal clinical documentation.
Islam, T.; Danner, M.; Ziad, Z.; Begemann, M.; Beijer, D.; Lischka, A.; Lausberg, E.; Mattern, L.; Suh, J.; Wittig, P.; Guezel, N.; Schlaich, E.; Karaivanova, R.; D'Augello, S.; Franken, L.; Ruedebusch, J.; Mueller, R.; Perchalla, E.; Zempel, H.; Haag, N.; Eggermann, K.; Eggermann, T.; Meyer, R.; Kraft, F.; Elbracht, M.; Kurth, I.; Krause, J.
Show abstract
Background: Molecular medicine has made genetic diagnostics crucial for rare diseases, but the majority of patients remains without diagnosis even after state-of-the-art assessment. Standardized systems for integrating clinical features, such as the Human Phenotype Ontology (HPO), offer assistance, but are often insufficiently detailed and fail to capture crucial clinical parameters such as age at onset, longitudinal changes in symptoms, detailed characteristics of a clinical symptom, or the absence of a feature. Results: We present Genosolver an integrated workflow that utilizes machine learning to address this bottleneck. Using Large Language Models (LLMs) and Large Reasoning Models (LRMs) on unstructured clinical notes and electronic health care data, we generate a workflow that unifies phenotype extraction, generates differential diagnosis, and prioritizes genetic variants from genome data. We evaluated the performance on 233 previously genetically solved cases, where Genosolver ranked the causative gene first in 72% of cases and in 94% of cases in the top 10 gene list, outperforming the existing benchmarking tool Exomiser by 9%. Semi-automated reanalysis of 1,875 unsolved rare disease cases yielded an additional diagnostic rate of 1.7%. Incorporating rich, unstandardized clinical narratives substantially enhanced model performance beyond HPO-only inputs and demonstrated competitive results using data security compliant local models. Conclusion: Integrating unstandardized clinical data with local LLMs and reasoning offers a scalable, data-secure workflow that increases molecular diagnoses in rare diseases.